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Abstract 

We propose an adaptive diffusion mechanism to optimize a global cost function in a distributed 
manner over a network of nodes. The cost function is assumed to consist of a collection of individual 
components. Diffusion adaptation allows the nodes to cooperate and diffuse information in real-time; it 
also helps alleviate the effects of stochastic gradient noise and measurement noise through a continu- 
ous learning process. We analyze the mean-square-error performance of the algorithm in some detail, 
including its transient and steady-state behavior We also apply the diffusion algorithm to two problems: 
distributed estimation with sparse parameters and distributed localization. Compared to well-studied 
incremental methods, diffusion methods do not require the use of a cyclic path over the nodes and 
are robust to node and link failure. Diffusion methods also endow networks with adaptation abilities 
that enable the individual nodes to continue learning even when the cost function changes with time. 
Examples involving such dynamic cost functions with moving targets are common in the context of 
biological networks. 



Index Terms 

Distributed optimization, diffusion adaptation, incremental techniques, learning, energy conservation, 
biological networks, mean-square performance, convergence, stability. 



I. Introduction 

We consider the problem of optimizing a global cost function in a distributed manner. The cost function 
is assumed to consist of the sum of individual components, and spatially distributed nodes are used to 
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seek the common minimizer (or maximizer) through local interactions. Such problems abound in the 
context of biological networks, where agents collaborate with each other via local interactions for a 
common objective, such as locating food sources or evading predators ||3|. Similar problems are common 
in distributed resource allocation applications and in online machine learning procedures. In the latter 
case, data that are generated by the same underlying distribution are processed in a distributed manner 
over a network of learners in order to recover the model parameters (e.g., Q, ||5|). 

There are already a few of useful techniques for the solution of optimization problems in a distributed 
manner ||6|-|24|. Most notable among these methods is the incremental approach |[6|-|[9| and the con- 



sensus approach |10|-|23|. In the incremental approach, a cyclic path is defined over the nodes and 
data are processed in a cyclic manner through the network until optimization is achieved. However, 



determining a cyclic path that covers all nodes is known to be an NP-hard problem 1 25 1 and, in addition, 
cyclic trajectories are prone to link and node failures. When any of the edges along the path fails, the 
sharing of data through the cyclic trajectory is interrupted and the algorithm stops performing. In the 
consensus approach, vanishing step-sizes are used to ensure that nodes reach consensus and converge 
to the same optimizer in steady-state. However, in time-varying environments, diminishing step-sizes 
prevent the network from continuous learning and optimization; when the step-sizes die out, the network 



stops learning. In earlier publications |26|-|35|, and motivated by our work on adaptation and learning 
over networks, we introduced the concept of diffusion adaptation and showed how this technique can be 
used to solve global minimum mean-square-eiTor estimation problems efficiently both in real-time and 
in a distributed manner. In the diffusion approach, information is processed locally and simultaneously 
at all nodes and the processed data are diffused through a real-time sharing mechanism that ripples 
through the network continuously. Diffusion adaptation was applied to model complex patterns of behavior 
encountered in biological networks, such as bird flight formations | |36| and fish schooling [3]. Diffusion 
adaptation was also applied to solve dynamic resource allocation problems in cognitive radios |37ij. 



to perform robust system identification |38|, and to implement distributed online learning in pattern 
recognition applications Q. 

This paper generalizes the diffusive learning process and applies it to the distributed optimization 
of a wide class of cost functions. The diffusion approach will be shown to alleviate the effect of 
gradient noise on convergence. Most other studies on distributed optimization tend to focus on the 
almost-sure convergence of the algorithms under diminishing step-size conditions Q, ||7|, p9|-|43 1, or 
on convergence under deterministic conditions on the data |[6|-|[8|, 1 15 1. In this article we instead examine 
the distributed algorithms from a mean-square-error perspective at constant step-sizes. This is because 
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constant step-sizes are necessary for continuous adaptation, learning, and tracking, which in turn enable 
the resulting algorithms to perform well even under data that exhibit statistical variations, measurement 
noise, and gradient noise. 

This paper is organized as follows. In Sec. [llj we introduce the global cost function and approximate 
it by a distributed optimization problem through the use of a second-order Taylor series expansion. In 
Sec. Illl| we show that optimizing the localized alternative cost at each node k leads naturally to diffusion 



adaptation strategies. In Sec. IV we analyze the mean-square performance of the diffusion algorithms 
under statistical perturbations when stochastic gradients are used. In Sec. |V| we apply the diffusion 
algorithms to two application problems: sparse distributed estimation and distributed localization. Finally, 



in Sec. VI we conclude the paper. 

Notation. Throughout the paper, all vectors are column vectors except for the regressors {uk^i}, which 
are taken to be row vectors for simplicity of notation. We use boldface letters to denote random quantities 
(such as Uk i) and regular font letters to denote their realizations or deterministic variables (such as j). 
We write E to denote the expectation operator. We use diagjxi, . . . ,xn} to denote a diagonal matrix 
consisting of diagonal entries xi, . . . ,xn, and use col{xi, . . . ,xj\[} to denote a column vector formed 
by stacking xi, . . . ,X]\[ on top of each other. For symmetric matrices X and Y, the notation X < Y 
denotes Y — X > 0, namely, that the matrix difference Y — X is positive semi-definite. 

II. Problem Formulation 

The objective is to determine, in a collaborative and distributed manner, the M x 1 column vector w° 
that minimizes a global cost of the form: 

N 

1=1 



(1) 



where Ji{w), / = 1, 2, . . . , A^, are individual real-valued functions, defined over w £ and assumed to 
be differentiable and strictly convex. Then, J^^"^{w) in ([T]) is also strictly convex so that the minimizer 



w° is unique |44|. In this article we study the important case where the component functions {Ji{w)} are 
minimized at the same w°. This case is common in practice; situations abound where nodes in a network 
need to work cooperatively to attain a common objective (such as tracking a target, locating the source 
of chemical leak, estimating a physical model, or identifying a statistical distribution). This scenario is 
also frequent in the context of biological networks. For example, during the foraging behavior of an 
animal group, each agent in the group is interested in determining the same vector w" that corresponds 
to the location of the food source or the location of the predator [3|. This scenario is equally common 
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in online distributed machine learning problems, where data samples are often generated from the same 
underlying distribution and they are processed in a distributed manner by different nodes (e.g., Q, |[5|). 
The case where the {Ji{w)} have different individual minimizers is studied in [45 1; this situation is more 



challenging to study. Nevertheless, it is shown in |45| that the same diffusion strategies ([T8|)-([T9l) of this 
paper are still applicable and nodes would converge instead to a Pareto-optimal solution. 

Our strategy to optimize the global cost J^^°^{w) in a distributed manner is based on three steps. 
First, using a second-order Taylor series expansion, we argue that J^^°^{w) can be approximated by an 



alternative localized cost that is amenable to distributed optimization — see ( [TT] ). Second, each individual 
node optimizes this alternative cost via a steepest-descent procedure that relies solely on interactions 
within the neighborhood of the node. Finally, the local estimates for w° are spatially combined by each 
node and the procedure repeats itself in real-time. 

To motivate the approach, we start by introducing a set of nonnegative coefficients {c; fc} that satisfy: 



N 



fe=l 



1, 



Oif/^TVfc, / = 1,2, ...,iV 



(2) 



where J\fk denotes the neighborhood of node k (including node k itself); the neighbors of node k consist 
of all nodes with which node k can share information. Each c; ^ represents a weight value that node 
k assigns to information arriving from its neighbor /. Condition Q states that the sum of all weights 
leaving each node / should be one. Using the coefficients {c/^fc}, we can express 

jgiob^^) from ^ as 



N 



where 



(3) 



(4) 



In other words, for each node k, we are introducing a new local cost function, jj^'^{w), which corresponds 
to a weighted combination of the costs of its neighbors. Since the {ci^k} are all nonnegative and each 
Ji{w) is convex, then jj^'^{w) is also a convex function (actually, the J^'^{w) will be guaranteed to be 
strongly convex in our treatment in view of Assumption [T] further ahead). 

Now, each j'°'^(w) in the second term of Q can be approximated via a second-order Taylor series 
expansion as: 



t'^iw) ^ jriw°) + \\w 



loc/ 



■w 



o||2 



(5) 
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Fig. 1. A network with A'^ nodes; a cost function Jk{w) is associated witii each node k. The set of neighbors of node k is 
denoted by A/fc; this set consists of all nodes with which node k can share information. 

where F; = is the (scaled) Hessian matrix relative to w and evaluated at w = w°, and the 

notation ||a||| denotes a^Sa for any weighting matrix S. The analysis in the subsequent sections will 
show that the second-order approximation Q is sufficient to ensure mean-square convergence of the 
resulting diffusion algorithm. Now, substituting Q into the right-hand side of (|3]) gives: 

« J^H+j; \\w-w''\\l+Y,j\'''{w°) (6) 
l^k li^k 

The last term in the above expression does not depend on the unknown w. Therefore, we can ignore it 
so that optimizing J^'^'^ivS) is approximately equivalent to optimizing the following alternative cost: 

Jgl°b'(^) A jloc(^)^^||^_^o||2^ 

III. Iterative Diffusion Solution 

Expression (|7]l relates the original global cost ([T]) to the newly-defined local cost function jj^'^{w). 
The relation is through the second term on the right-hand side of (|7]), which corresponds to a sum of 
quadratic terms involving the minimizer w°. Obviously, w" is not available at node k since the nodes wish 
to estimate w". Likewise, not all Hessian matrices Ti are available to node k. Nevertheless, expression ^ 
suggests a useful approximation that leads to a powerful distributed solution, as we proceed to explain. 

Our first step is to replace the global cost [w) by a reasonable localized approximation for it at 
every node k. Thus, initially we limit the summation on the right-hand side of ([T]) to the neighbors of 
node k and introduce the cost function: 

leAfk\{k} 
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Compared with (|7]), the last term in ([8]) involves only quantities that are available in the neighborhood of 
node k. The argument involving steps ([5])-([8]) therefore shows us one way by which we can adjust the 
earlier local cost function J^'^{w) defined in Q by adding to it the last term that appears in ([8]l. Doing 
so, we end up replacing J^'^{w) by Jf'"*^ (w), and this new localized cost function preserves the second 
term in Q up to a second-order approximation. This correction will help lead to a diffusion step (see 
([ll)-©). 

Now, observe that the cost in ^ includes the quantities {F/}, which belong to the neighbors of node k. 
These quantities may or may not be available. If they are known, then we can proceed with ([8]) and rely 
on the use of the Hessian matrices in the subsequent development. Nevertheless, the more interesting 
situation in practice is when these Hessian matrices are not known beforehand (especially since they 
depend on the unknown w°). For this reason, in this article, we approximate each F^ in ([8]) by a multiple 
of the identity matrix, say, 

F/ « bi,klM (9) 

for some nonnegative coefficients {&z,fc}; observe that we are allowing the coefficient bi^k to vary with 
the node index k. Such approximations are common in stochastic approximation theory and help reduce 
the complexity of the resulting algorithms — see 1441 pp.20-28] and ||46l pp.142-147]. Approximation 



([9]) is reasonable since, in view of the Rayleigh-Ritz characterization of eigenvalues |47|, we can always 
bound the weighted squared norm — it;°||p^ by the unweighted squared norm as follows 

Amin(F/) • < ||'U;-'U;°||r, < Amax(F/) • 

Thus, we replace ^ by 

leAfk\{k} 

As the derivation will show, we do not need to worry at this stage about how the scalars {bi ^} are 
selected; they will be embedded into other combination weights that the designer selects. If we replace 
J^^'^{w) by its definition Q, we can rewrite ( fTO] ) as 



(11) 



Observe that cost ( [TT] ) is different for different nodes; this is because the choices of the weighting scalars 
{ci,k,bi^k} vary across nodes k; moreover, the neighborhoods vary with k. Nevertheless, these localized 
cost functions now constitute the important starting point for the development of diffusion strategies for 
the online and distributed optimization of ([T]). 

May 15, 2012 DRAFT 



Each node k can apply a steepest-descent iteration to minimize J^^°^ (w) by moving along the negative 
direction of the gradient (column) vector of the cost function, namely, 

Wk,i = Wk,i-1 - IJ'k'Yl, '^l,k^wJl{wk,i-l) 

- I^k ^ '^bi,k(.^k,i-i - w"), i>0 (12) 
where w^^i denotes the estimate for w° at node k at time i, and jj.^ denotes a small constant positive 



step-size parameter. While vanishing step-sizes, such as /ife(i) = l/i, can be used in ([12]), we consider in 
this paper the case of constant step-sizes. This is because we are interested in distributed strategies that 
are able to continue adapting and learning. An important question to address therefore is how close each 
of the Wk,i gets to the optimal solution w"; we answer this question later in the paper by means of a 



mean- square-error convergence analysis (see expression ([88|)). It will be seen then that the mean-square- 
error (MSE) of the algorithm will be of the order of the step-size; hence, sufficiently small step-sizes 
will lead to sufficiently small MSEs. 

Expression ( [T2] ) adds two correction terms to the previous estimate, Wk^i-i, in order to update it to 
Wk^i- The correction terms can be added one at a time in a succession of two steps, for example, as: 

V'M = Wk,i-l - l^k^Yl ^l,k'^wJl{Wk,i-l) (13) 

Wk,i = ipk,i - fJ'k ^ '^bi,k{^k,i--i - w°) (14) 

l(iNk\{k} 

Step ( [T3| ) updates Wk^i-i to an intermediate value t/j^^i by using a combination of local gradient vectors. 
Step ( [T4| ) further updates ^pk^i to Wk^i by using a combination of local estimates. However, two issues 
arise while examining ( [T?] ): 



(a) First, iteration ( [T4| ) requires knowledge of the optimizer w". However, all nodes are running similar 
updates to estimate the w". By the time node k wishes to apply ([14]), each of its neighbors would 
have performed its own update similar to ( [T3] ) and would have available their intermediate estimates, 
{ipi^i}- Therefore, we replace w" in ( [14] ) by V'i.j- This step helps diffuse information over the network 
and brings into node k information that exists beyond its immediate neighborhood; this is because 
each ipi^i is influenced by data from the neighbors of node We observe that this diffusive term 
arises from the quadratic approximation ([5]) we have made to the second term in ([3]). 



(b) Second, the intermediate value ipk,i in ( ]T3) is generally a better estimate for w° than Wk,i-i 



since it is obtained by incorporating information from the neighbors through (13i. Therefore, we 
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further replace Wk,i-i in ( [T4| ) by ^^ j. This step is reminiscent of incremental-type approaches to 
optimization, which have been widely studied in the literature |[6|-|[9|. 



Performing the substitutions described in items (a) and (b) into ( [14] ), we obtain 
Now introduce the coefficients 

A 



ipk,i-l^k ^ '2bi^k{i^k,i - ipi,i) 

leAfk\{k} 



ai,k = '^fJ'kbi,k {l^k), ak,k 



(15) 



(16) 



ieAf^\{k} 

Note that the {ai k} are nonnegative for I k and ^ > for sufficiently small step-sizes. Moreover, 
the coefficients {a/,fc} satisfy 

N 

J2 ai,k = 1, ai^k = if / ^ (17) 



1=1 



Using ([16]) in (15), we arrive at the following Adapt- then-Combine (ATC) diffusion strategy (whose 
structure is the same as the ATC algorithm originally proposed in 
estimation): 



1 3 1 1 for mean-square-error 



i^k,i = Wk^i-l - fJ-k'^ Cl^k^wJl{Wk,i-l) 

ai,ki>l,i 



(18) 



Wk 



To run algorithm ( 18 ), we only need to select combination coefficients {a; fc, q satisfying (|2]) and ( [T7] ), 
respectively; there is no need to worry about the intermediate coefficients {bi^k} any more, since they 



have been blended into the {a^k}- The ATC algorithm (1^) involves two steps. In the first step, node k 
receives gradient vector information from its neighbors and uses it to update its estimate Wk^i-i to an 
intermediate value V'fc.i- All other nodes in the network are performing a similar step and generating their 
intermediate estimate V'z.i- In the second step, node k aggregates the estimates {ipi^i} of its neighbors and 
generates Wk^i- Again, all other nodes are performing a similar step. Similarly, if we reverse the order of 
steps ( [T3] ) and ( [T4] ) to implement ( [T2] ), we can motivate the following alternative Combine-then-Adapt 



(CTA) diffusion strategy (whose structure is similar to the CTA algorithm originally proposed in |26|-|32| 
for mean-square-error estimation): 



tpk,ir-i = ^ ai,kWi,i-i 

Wk,i = ■4'k,i^l - /"fc X] ^l,k^wJl{'4'k,i-l) 



(19) 
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Adaptive diffusion strategies of the above ATC and CTA types were first proposed and extended in 126]- 
p4) for the solution of distributed mean-square-error, least-squares, and state-space estimation problems 



over networks. The special form of ATC strategy ( |T8| ) for minimum-mean-square-error estimation is listed 
further ahead as Eq. ( [57] ) in Example [3| the same strategy as (57i also appeared in [48] albeit with a 
vanishing step-size sequence to ensure convergence towards consensus. A special case of the diffusion 
strategy ^19\ (corresponding to choosing ci^k = for / 7^ k and Ck,k = 1> without sharing gradient 



information) was used in the works |39|, |40|, |43| to solve distributed optimization problems that require 
all nodes to reach agreement about w° by relying on step-sizes that decay to zero with time. Diffusion 
recursions of the forms ( [TS] ) and ([T9]l are more general than these earlier investigations in a couple of 
respects. First, they do not only diffuse the local estimates, but they can also diffuse the local gradient 
vectors. In other words, two sets of combination coefficients {a; fc, q /;} are used. Second, the combination 
weights {ai^k} are not required to be doubly stochastic (which would require both the rows and columns 
of the weighting matrix A = [a; to add up to one; as seen from (fTTj), we only require the entries on 



the columns of A to add up to one). Finally, and most importantly, the step-size parameters {i^k} in ( [T8] ) 
and ( [T9I ) are not required to depend on the time index i and are not required to vanish as i — 00. Instead, 
they can assume constant values, which is critical to endow the network with continuous adaptation and 
learning abilities (otherwise, when step-sizes die out, the network stops learning). Constant step-sizes 
also endow networks with tracking abilities, in which case the algorithms can track time changes in the 
optimal w°. 

Constant step-sizes will be shown further ahead to be sufficient to guarantee agreement among the 
nodes when there is no noise in the data. However, when measurement noise and gradient noise are 
present, using constant step-sizes does not force the nodes to attain agreement about w° (i.e., to converge 
to the same w°). Instead, the nodes will be shown to tend to individual estimates for w" that are within 
a small mean-square-error (MSE) bound from the optimal solution; the bound will be proportional to the 
step-size so that sufficiently small step-sizes lead to small MSE values. Multi-agent systems in nature 
behave in this manner; they do not require exact agreement among their agents but allow for fluctuations 
due to individual noise levels (see |36|). Giving individual nodes this flexibility, rather than forcing 
them to operate in agreement with the remaining nodes, ends up leading to nodes with enhanced learning 
abilities. 



Before proceeding to a detailed analysis of the performance of the diffusion algorithms ([T8|)-([T9|), we 
note that these strategies differ in important ways from traditional consensus-based distributed solutions. 
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which are of the following form (TO}, (14), |[T5|, (18): 



Wk,i = ^ a/,fc«^fc,i-i - f^k{i) • '^wMwk,i-i) (20) 

usually with a time-variant step-size sequence, fiki^)^ that decays to zero. For example, if we set C = 
[Q,k] = / in the CTA algorithm ^T9\ and substitute the combination step into the adaptation step, we 
obtain: 



(^i,kWk,i-i - IJ-k'^wJi ( ^ ai^kWk,i-i I (21) 



Thus, note that the gradient vector in ( \2T\ is evaluated at ipk,i-i, while in pO| ) it is evaluated at Wk,i-i 



Since ipk,i-i already incorporates information from neighbors, we would expect the diffusion algorithm 



to perform better. Actually, it is shown in |49| that, for mean- square-error estimation problems, diffusion 
strategies achieve higher convergence rate and lower mean-square-error than consensus strategies due to 
these differences in the dynamics of the algorithms. 

IV. Mean-Square Performance Analysis 
The diffusion algorithms (J^i and ( [191 ) depend on sharing local gradient vectors V^J;(-). In many 



cases of practical relevance, the exact gradient vectors are not available and approximations are instead 
used. We model the inaccuracy in the gradient vectors as some random additive noise component, say, 
of the form: 

VwJiiw) = VwJiiw) + vi^i{w) (22) 

where i>z,t(-) denotes the perturbation and is often referred to as gradient noise. Note that we are using 
a boldface symbol v to refer to the gradient noise since it is generally stochastic in nature. 

Example 1. Assume the individual cost Ji{w) at node I can be expressed as the expected value of a 
certain loss function Qi{-, •), i.e., Ji{w) = ¥,{Qi{w,xi^i)}, where the expectation is with respect to the 
randomness in the data samples {xi^i} that are collected at node / at time i. Then, if we replace the true 
gradient V^J;(w) with its stochastic gradient approximation V^J/(w) = VwQi{w,xi^i), we find that 
the gradient noise in this case can be expressed as 

vi,i{w) = VwQi{w, xi^i) - VyM{Qi{w, xi^i)} (23) 
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Using the perturbed gradient vectors (22 1, the diffusion algorithms (28i-(19l become the following: 

(ATC) ^^^^ (24) 

Wk,i = ^ ai^ki'i,! 



(CTA) 





ai,k'WLi-i 






Wk,i = 'ipk,i-i 


-fJ'k ^ Q,fcVt„Ji(-0fc,i-l) 







(25) 



Observe that, starting with (|24])-([25]), we will be using boldface letters to refer to the various estimate 
quantities in order to highlight the fact that they are also stochastic in nature due to the presence of the 
gradient noise. 

Given the above algorithms, it is necessary to examine their performance in light of the approximation 
steps (|6]l-(15l that were employed to arrive at them, and in light of the gradient noise p2l ) that seeps 
into the recursions. A convenient framework to carry out this analysis is mean-square analysis. In this 
framework, we assess how close the individual estimates Wk^i get to the minimizer w° in the mean-square- 
error (MSE) sense. In practice, it is not necessary to force the individual agents to reach agreement and 
to converge to the same w° using diminishing step-sizes. It is sufficient for the nodes to converge within 
acceptable MSE bounds from w". This flexibility is beneficial and is common in biological networks; it 
allows nodes to learn and adapt in time-varying environments without the forced requirement of having 
to agree with neighbors. 

The main results that we derive in this section are summarized as follows. First, we derive conditions on 
the constant step-sizes to ensure boundedness and convergence of the mean-square-error for sufficiently 
small step-sizes — see (80 1 and ( |106[ ) further ahead. Second, despite the fact that nodes influence each 
other's behavior, we are able to quantify the performance of every node in the network and to derive 
closed-form expressions for the mean-square performance at small step-sizes — see ( 106 l-( T08| ). Finally, 
as a special case, we are able to show that constant step-sizes can still ensure that the estimates across 
all nodes converge to the optimal w° and reach agreement in the absence of noise — see Theorem |2] 



Motivated by |31|, we address the mean-square-error performance of the adaptive ATC and CTA 
diffusion strategies (p4|)-([25|) by treating them as special cases of a general diffusion structure of the 
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following form: 



N 



4>k,i~l = y^Pl,;,fcW;, 
1=1 

V'fc.i = (l)k,i-l - /"fc X] ^^'f" ^wJli4>k,i-l) + VLi{4>k,i-l) 



N 



N 



Wk,i = ^P2,l,ktl^, 



Li 



(26) 
(27) 
(28) 



The coefficients and {p2,/,A;} are nonnegative real coefficients corresponding to the {I, k}- 

th entries of three matrices Pi, 5, and P2, respectively. Different choices for {^1,^2,5'} correspond to 
different cooperation modes. For example, the choice Pi = I, P2 = I and S = I corresponds to the 
non-cooperative case where nodes do not interact. On the other hand, the choice Pi = I, P2 = A = [a; fc] 



and S = C = [ci^k] corresponds to ATC |29|-|31|, while the choice Pi = A, P2 = I and S = C 



corresponds to CTA ||26|-| 3 1 1 . We can also set 5 = / in ATC and CTA to derive simplified versions 
that have no gradient exchange f2Si\. Furthermore, if in CTA {P2 = I), we enforce Pi = ^ to be doubly 
stochastic, set S = I, and use a time-decaying step-size parameter {^k{i) — ^ 0), then we obtain the 



unconstrained version used by |39|, |43|. The matrices {Pi,P2,5} are required to satisfy: 



Pf 1 = 1, P^t = 1, 51 = 1 



(29) 



where the notation 1 denotes a vector whose entries are all equal to one. 



A. Error Recursions 



We first derive the error recursions corresponding to the general diffusion formulation in ([26|)-p8 1. 
Introduce the error vectors: 



(t>k,i = W°-(f)k,i, IpkA = W°-ll)k,i, Wk,i = W°-Wk,i 



Then, subtracting both sides of (26l-(28l from w" gives: 



N 

4>k,i^i = y^^pi,i,k'wi^i-i 
1=1 

N 



'ipk,i = 4>k,i-l + M/c X] ^'''^ ^wJl{(tik,i-l) + Vl^i{4>k, 



■,i-l 



1=1 



N 



Wk,i = '^P2,l,k'4'l,i 



(30) 

(31) 
(32) 
(33) 
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Expression (32 1 still includes terms that depend on 4>k,i-i and not on the error quantity, (pk^i-i- We 



can find a relation in terms of (ftk^i-i by calling upon the following result from |44 p.24] for any 
twice-differentiable function /(•): 



V/(y) = Vf{x) + 



j\^f(x+t{y-x))dt 



{y - x) 



(34) 



where V^/(-) denotes the Hessian matrix of /(•) and is symmetric. Now, since each component function 



Ji{w) has a minimizer at w", then, VwJi{w°) = for I = 1,2, . . . , N. Applying (|34]) to Ji{w) using 
X = w" and y = ^k,i~i, we get 



— —Hi^k,i-l4>k,i-l 

where we are introducing the symmetric random matrix 



0, 



kA-1 



Hi 



LkA-l 



vIjAw" -t4>k,i-i)dt 



(35) 



(36) 



Observe that one such matrix is associated with every edge linking two nodes {l,k); observe further that 



this matrix changes with time since it depends on the estimate at node k. Substituting p5])-p6|) into p2] ) 
leads to: 



N 



k,i-l 



1=1 



^k,i-l 



N 



+ tJ'k'^Sl^kVl,ii4> 



kA—lj 



(37) 



We introduce the network error vectors, which collect the error quantities across all nodes: 



4>i,i 




^l,i 








7 A 




~ A 
Wi = 




4>N,i 




'4>N,i 







(38) 
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and the following block matrices: 

Vi = Pi Im, V2 = P2^ Im (39) 
S =S®hu M = Vl®Im (40) 
VL = diag{^i, . . . , /xat} (41) 

N 

T^i-i = ^diaglsi^ii^i^i^i-i, • • • , Si^Ari^i^AT^i-ij (42) 
1=1 

N 

9i = ^col|si,itii,i((^i,^i), • • • ,s^,^r^|^,^(0Ar,^_l)| (43) 
1=1 

where the symbol denotes Kronecker products |50|. Then, recursions ( |3T] ), ([37]) and ( |33] ) lead to: 

Wi = V^[Imn - M'D,^i]V'(wi-i + V^Moi (44) 



To proceed with the analysis, we introduce the following assumption on the cost functions and gradient 
noise, followed by a lemma on iJ^ ^ 

Assumption 1 (Bounded Hessian). Each component cost function Ji{w) has a bounded Hessian matrix, 
i.e., there exist nonnegative real numbers Ai.min <^nd Aj^max such that Aj^min ^ A/^max <^nd that for all w: 

(45) 

Furthermore, the {M,mm}iLi satisfy 

N 

^Sl,kXl,min>0,k = l,2,...,N (46) 

1=1 



Condition (46) ensures that the local cost functions {J].°^(tf)} defined earlier in (|4]) are strongly convex 



and, hence, have a unique minimizer at w°. 

Assumption 2 (Gradient noise). There exist a > and cr^ > such that, for all w G J-i-i and for all 

i, I: 

E{^;;,i(tu) I = (47) 

E{||-j;i,i(i(;)f} < a'&Ww" - wf + al (48) 

where Ti-\ denotes the past history (o— field) of estimates {w^j} for j <i — 1 and all k. ■ 
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Lemma 1 (Bound on iJ; ^ Under Assumption^ the matrix Hi^k,i-i defined in (36 1 is a nonnegative- 
definite matrix that satisfies: 



(49) 



Proof: It suffices to prove that Aj^min < Hi^k,i-ix < A^^max for arbitrary M x 1 unit Euclidean norm 



vectors x. By (36 1 and (45 1, we have 



X HlkA-lX 



x^Vl^Ji ( w° - t4>k,i-i ) X dt 



JO 



In a similar way, we can verify that x Hi^k,i-ix > A^^] 



In distributed subgradient methods (e.g., |15|, |39|, |43|), the norms of the subgradients are usually 



required to be uniformly bounded. Such assumption is restrictive in the unconstrained optimization of 
differentiable functions. Assumption [T] is more relaxed in that it allows the gradient vector VwJi{w) 



to have unbounded norm (e.g., quadratic costs). Furthermore, condition ( |48) allows the variance of the 
gradient noise to grow no faster than E||t(;° — tup. This condition is also more general than the uniform 
bounded assumption used in ||39| (Assumptions 5.1 and 6.1), which requires instead: 

E\Mw)f<al E{\Mw)f\T,^i}<al (50) 



(51) 



Furthermore, condition ( [481 ) is similar to condition (4.3) in |51 p.635]: 

E{\\v^iw)f\T^-i} < a[\\V^Jiiw)f + l 

which is a combination of the "relative random noise" and the "absolute random noise" conditions defined 
44 pp.100-102]. Indeed, we can derive ( |48l ) by substituting (35l into (51_l, taking expectation with 



m 



respect to J^i-i, and then using (49 1. 



Example 2. Such a mix of "relative random noise" and "absolute random noise" is of practical impor- 
tance. For instance, consider an example in which the loss function at node / is chosen to be of the 
following quadratic form: 

Qi{w, {ui^i, di{i)}) = \di{i) - ui^iw\'^ 
for some scalars {di{i)} and 1 x M regression vectors {ui i}. The corresponding cost function is then: 



Ji{w) = E\di{i) - ui^iw\ 



(52) 
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Assume further that the data {ui^i, di{i)} satisfy the Unear regression model 

Mi) = ui^iW° + zi{i) 



(53) 



where the regressors {ui i} are zero mean and independent over time with covariance matrix R^^ i = 
E{ti^jii/^j}, and the noise sequence {zk{j)} is also zero mean, white, with variance a1 and independent 



of the regressors for all /, k, Then, using ([53]) and ( [23] ), the gradient noise in this case can be 

expressed as: 



vi,i{w) = 2{Ruj - uliUii){w° -w) - 2u[iZi{i) 



(54) 



It can easily be verified that this noise satisfies both conditions stated in Assumption [2| namely, ( [47] ) and 
also: 



E{\\vi4w)f} 



(55) 



for all w G -Fj-i. Note that both relative random noise and absolute random noise components appear in 



(55 1 and are necessary to model the statistical gradient perturbation even for quadratic costs. Such costs. 



and linear regression models of the form ( |53] ), arise frequently in the context of adaptive filters — see, 

e.g., i9|, pa-pg, i36|, m, pg-pg. ■ 



Example 3. Quadratic costs of the form ( [521 ) are common in mean-square-error estimation for linear 
regression models of the type ([53]). If we use the instantaneous approximations as is common in the context 



of stochastic approximation and adaptive filtering |44|, |46|, |52|, then the actual gradient Vii;Ji{w) can 
be approximated by 



VwJiiw) = VwQi{w, {ui^i, di{i)}) 
= -2ufi[di{i) - ui^iw] 



(56) 



Substituting into (24i-(25 1, and assuming C7 = / for illustration purposes only, we arrive at the following 



ATC and CTA diffusion strategies originally proposed and extended in p6|-| 3 1 1 for the solution of 
distributed mean-square-error estimation problems: 



(ATC) 



= iffc,i-i + 2/ifcttfc^j[dfe(i)-iifc,iiUfc,i-i] 



(57) 



May 15, 2012 



DRAFT 



17 



(CTA) 



Wk,i = '0fc,i-l+2/ifcwL[dfc(z)-tifc,i'0A:,i-l] 



(58) 



B. Variance Relations 

The purpose of the mean-square analysis in the sequel is to answer two questions in the presence 
of gradient perturbations. First, how small the mean-square error, E||ii;fc gets as i — )• oo for any 
of the nodes k. Second, how fast this error variance tends towards its steady-state value. The first 
question pertains to steady-state performance and the second question pertains to transient/convergence 
rate performance. Answering such questions for a distributed algorithm over a network is a challenging 
task largely because the nodes influence each other's behavior: performance at one node diffuses through 
the network to the other nodes as a result of the topological constraints linking the nodes. The approach 
we take to examine the mean-square performance of the diffusion algorithms is by studying how the 
variance E||ii3fc j|p, or a weighted version of it, evolves over time. As the derivation will show, the 
evolution of this variance satisfies a nonlinear relation. Under some reasonable assumptions on the noise 
profile, and the local cost functions, we will be able to bound these error variances as well as estimate 
their steady-state values for sufficiently small step-sizes. We will also derive closed-form expressions that 
characterize the network performance. The details are as follows. 

Equating the squared weighted Euclidean norm of both sides of ( |44] ), applying the expectation operator 
and using using (|47]), we can show that the following variance relation holds: 



ills 



E\\wi\\l = E\^\\w^^i\g,^+E\\V^Mg, 
^' = Vi[lMN-MT>i^i]V2^VUhiN-MT>.i^i]Vl 



(59) 



where S is a positive semi-definite weighting matrix that we are free to choose. The variance expression 
(59 1 shows how the quantity E||i(;j||| evolves with time. Observe, however, that the weighting matrix 



on tUj-i on the right-hand side of (|59]) is a different matrix, denoted by S', and this matrix is actually 



random in nature (while S is deterministic). As such, result ( |59| ) is not truly a recursion. Nevertheless, 
it is possible, under a small step-size approximation, to rework variance relations such as ( |59l ) into a 
recursion by following certain steps that are characteristic of the energy conservation approach to mean- 



square analysis |46|. 

The first step in this regard would be to replace 51' by its mean ES'. However, the matrix S' depends on 
the {Hi^k,i-i} via T>i-i (see ^42\). It follows from the definition of Hi^k,i-i in ( |36l ) that S' is dependent 
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on 0fc,j-i as well, which in turn is a linear combination of the {wi^i^i}. Therefore, the main challenge to 



continue from ([59]) is that S' depends on Wi-i. For this reason, we cannot apply directly the traditional 
step of replacing T,' in the first equation of ([59]) by ES' as is typically done in the study of stand-alone 
adaptive filters to analyze their transient behavior ||46l, p.345]; in the case of conventional adaptive filters, 
the matrix is independent of tUj-i. To address this difficulty, we shall adjust the argument to rely 
on a set of inequality recursions that will enable us to bound the steady-state mean-square-error at each 
node — see Theorem [T] further ahead. 

The procedure is as follows. First, we note that is a convex function of x, and that expressions 



(31 1 and ( [33] ) are convex combinations of {w^ and {ipi^i}, respectively. Then, by Jensen's inequality 



1 56 p.77] and taking expectations, we obtain 



N 



E\\wk 



1=1 

N 

1=1 



(60) 
(61) 



for k = 1, . . . , N. Next, we derive a variance relation for ( [37) . Equating the squared Euclidean norms of 
both sides of ( |37] ), applying the expectation operator, and using ( [47] ) from Assumption ]2] we get 



E\\^Pk,if=E\\\4)k,^-l\\L,_ 



E 



N 



i~l, 



1=1 



where 



■ikA-l 



N 



kA-1 



1=1 



(62) 



(63) 



We call upon the following two lemmas to bound (62i. 



Lemma 2 (Bound on Ylk^i-i). The weighting matrix Sfc,i-i defined in (63 1 is a symmetric, positive 
semi-definite matrix, and satisfies: 



< ^k,^-l < IpM 



where 





N 




N 




7fc = max i 




1 




1 




1=1 




1=1 





(64) 



(65) 
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Proof: By definition (63l and the fact that Hi^k,i-i is symmetric — see definition (36), the matrix 
iM—fJ-k S/li si,kHi^k,i-i is also symmetric. Hence, its square, Xl^ is symmetric and also nonnegative- 



definite. To establish (64), we first use (49i to note that: 



N / N \ 

1=1 \ 1=1 ) 

N / ^ \ 

1=1 \ 1=1 ) 



(66) 



(67) 



The matrix 1m — fJ-k "^fLi si,kH^k,i-i may not be positive semi-definite because we have not specified 



a range for /i^ yet; the expressions on the right-hand side of ([66l)-([67]) may still be negative. However, 
inequalities (|66l)-(|67]) imply that the eigenvalues of Im — l^k J2iLi si,kHi^k,i-i are bounded as: 



N 



N 



A I /m - /^fc ^ si,kHi^k,i-i 1 > 1 - ^fc ^ si,k^ 
\ 1=1 J 1=1 

/ N \ N 

A I /m — /^fc ^ si^kHi^k,i-i I < 1 — ^fc ^ si^kM 



i,max 



(68) 



(69) 



1=1 



1=1 



By definition ([63]), ^k,i-i is the square of the symmetric matrix lu—fJ-k "^iLi si,kHi^k,i-i, meaning that 



A (5]fc,i-i) 



TV 



A I hi — fJ-k ^ si^kHi^k,i-i 



1=1 



> 



(70) 



Substituting (68l-(69l into (70 1 leads to 



< max ■ 
which is equivalent to (|64]). 



AT 



1 - /»fc sf.fcA, 



i,max 



1=1 



N 



1 - /^fc ^ Si,fcA 



i,min 



1=1 



(71) 



Lemma 3 (Bound on noise combination). The second term on the right-hand-side of ([62]) satisfies: 



E 



TV 



si,kvi,i{4>k,i~-i) 



1=1 



< \\S\\l- aE\\4>k,i-if+al 



(72) 



where \\S\\i denotes the 1-norm of the matrix S (i.e., the maximum absolute column sum). 
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Proof: Applying Jensen's inequality |[56} p.77], it holds that 

N 2 

E ' 

1=1 



1=1 

N 



SLk 



N 

1=1 1^1=1 ^^.k 



-Vl,ii(l)k,i-l) 



N 



< 



1=1 

N 



''''' nviA^k,i-i)f 



1=1 1^1=1 ^l,k 



1=1 



1=1 ■ ' 

N 

1 



A;,i-lJ 



aE||0fc,j-i||^ + o-^ 



1=1 



l|2 I ^2 



(73) 
(74) 



where inequality ( [73] ) follows by substituting ( [48] ), and ( [74| ) is obtained using the fact that ||5||i is the 
maximum absolute column sum and that the entries {si^k} are nonnegative. ■ 



Substituting ( |64| ) and ( |72] ) into ([62]), we obtain: 

KlIV'Mf < {ll+f^la\\S\\l)-EUk,i-if+f^l \\S\\l (75) 
for k = 1,...,N. Finally, introduce the following network mean-square-error vectors (compare with 



(38i) 



n4>iA\^ 








Mwi4^ 




, yi = 
















IE||'u3Ar,i|P 



and the matrix 



r = diag{7i +^ia||5||?, ... + fi%a\\S\\l} 



(76) 



Then, (|60l)-(|6Tl) and (T5\ can be written as 

p^i-i ^ Pim-i 



yi<Tx,.i + cjl\\s\\lnH 



(77) 
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where the notation x < y denotes that the components of vector x are less than or equal to the 
corresponding components of vector y. We now recall the following useful fact that for any matrix 
F with nonnegative entries, 

x<y^ Fx<Fy (78) 
This is because each entry of the vector Fy — Fx = F{y — x) is nonnegative. Then, combining all three 



inequalities in (77i leads to: 



(79) 



C. Mean-Square Stability 



Based on < \T9\ , we can now prove that, under certain conditions on the step-size parameters {/ife}, the 
mean-square-error vector Wi is bounded as i — )• oo, and we use this result in the next subsection to 
evaluate the steady-state MSE for sufficiently small step-sizes. 



Theorem 1 (Mean-Square Stability). If the step-sizes {/ifc} satisfy the following condition: 







\ 


al +a\\S\\V ctI ■ +a 

K,max ' II 111 k,mm ' 


\s\\l 



(80) 



for k = 1, . . . ,N, where CTfc. max <^nd crfc^min ^^re defined as 

N 



N 



=1 



max) <7fe,min — / , •^(,fc^>J,mm 
1=1 



,1 



(81) 



then, as i ^ oo, 



limsup ||>Vi||oo < 



max tii ] ■ ||5||?cr,^ 
l<A:<Ar ^ 



1— max (^u + Ui,a\\S\\^] 
i<k<N " ' 



(82) 



where 11 x||oo denotes the maximum absolute entry of vector x. 

Proof: See Appendix |A] ■ 
If we let a=0 and (T^=0 in Theorem [T] and examine the arguments leading to it, we conclude the 



validity of the following result, which establishes the convergence of the diffusion strategies ([24|-p5) 
in the absence of gradient noise (i.e., using the true gradient rather than stochastic gradient — see ( 18 1 
and ([19])). 
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Theorem 2 (Convergence in Noise-free Case). If there is no gradient noise, i.e., a = and = 0, then 



the mean-square-error vector becomes the deterministic vector Wj = col{||it;i^j 
entries converge to zero if the step-sizes {fik} satisfy the following condition: 



0< fik< 



|'"^Ar,i|P}, and its 



(83) 



for k = 1, . . . ,N, where cTfc.max was defined in (81 



We observe that, in the absence of noise, the deterministic error vectors, Wk,i, will tend to zero as 
i — oo even with constant (i.e., non- vanishing) step-sizes. This result implies the interesting fact that, in 
the noise-free case, the nodes can reach agreement without the need to impose diminishing step-sizes. 



D. Steady-State Performance 

Expression ( |80l ) provides a condition on the step-size parameters {nk} to ensure the mean-square 

gives an upper bound on 



stability of the diffusion strategies (|24|)-(|25|). At the same time, expression 
how large Wi can be at steady-state. Since the oo-norm of a vector is defined as the largest absolute value 
of its entries, then ([82]) bounds the MSE of the worst-performing node in the network. We can derive 
closed-form expressions for MSEs when the step-sizes are assumed to be sufficiently small. Indeed, we 
first conclude from ([82]) that for step-sizes that are sufficiently small, each Wk^i will get closer to w° at 
steady-state. To verify this fact, assume the step-sizes are small enough so that the nonnegative factor 7^ 



that was defined earlier in ([65]) becomes 

7fc = 1 - 



N 

,1 

(=1 



1 



where ak^min was given by ( [81] ). Substituting ([84]) into ([82]), we obtain: 

limsup llWilloo 



(84) 



max III. 

l<k<N 



< 



I Cl|2^2 



1- max { {l-fikCTk 

l<k<N 



,min 



max 111, 

l<k<N 



I Cl|2^2 



< 



mm < Ilk 



l^k{(^k,mm + "ll'S'll?) 
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< 



mill < 2o-fcmm-Atfc(o-fcmin + «ll'S'lll) 
l<fc<Af ' 



/^max 
/^min 



(85) 



where 



l<fc<Af 



/imm= mm /ifc 

l<fc<Af 



For sufficiently small step-sizes, the denominator in ( [851 ) can be approximated as 

2crfe,min — A'fcCo'fc min + '^ll'S'lll) ~ '^(^k,mm 



Substituting into (85 1, we get 



limsup ||Wi 



< 



I S'l|2n-2 
\>^\\l^v 



_ ^max 

2 mill C^j rnin /^min 
l<fc<Af ' 



(86) 



(87) 



(88) 



Therefore, if the step-sizes are sufficiently small, the MSB of each node becomes small as well. This 
result is clear when all nodes use the same step-sizes such that /Umax = fJ-min = Then, the right-hand 



side of (88 1 is on the order of 0{fi), as indicated. It follows that {wk^i} are small in the mean-square-error 
sense at small step-sizes, which also means that the mean-square value of is small because it is 

a convex combination of {wk.i} (recall (pT)). Then, by definition (36l, in steady-state (for large enough 
i), the matrix i/; ^ j_i can be approximated by: 



H, 



,k,i-i- [ V^Ji{w")dt = V^Ji{w") 
Jo 



(89) 



In this case, the matrix Hi,k,i-i is not random anymore and is not dependent on the error vector 
Accordingly, in steady-state, the matrix X>j_i that was defined in ( |42] ) is not random anymore and it 
becomes 



N 








1=1 





As a result, in steady-state, the original error recursion (44i can be approximated by 



Taking expectations of both sides of ( |9T| ), we obtain the following mean-error recursion 

Ewi = Vj[lMN - MVoo]Vf -Ewi-i, i 
which converges to zero if the matrix 



oo 



(90) 



(91) 



(92) 



B^r^[lMN-MVoo]Vf 



(93) 
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is stable. The stability of B can be guaranteed when the step-sizes are sufficiently small (or chosen 
according to ( [80l )) — see the proof in Appendix |C] Therefore, in steady-state, we have 



lim Kwi 



(94) 



Next, we determine an expression (rather than a bound) for the MSE. To do this, we need to evaluate the 



covariance matrix of the gradient noise vector grj. Recall from (43l that Qi depends on {0^ which 
is close to w° at steady-state for small step-sizes. Therefore, it is sufficient to determine the covariance 
matrix of at w°. We denote this covariance matrix by: 



■ N 

^col|si,ii»i,i(u;°), 



Sl,NVl,i[W 



1=1 

N 



1} 



.1=1 



(95) 



In practice, we can evaluate from the expressions of {vi i{w")}. For example, for the case of the 



quadratic cost (52i, we can substitute (54i into (95l to evaluate Ry. 



Returning to the last term in the first equation of ( [59| ), we can evaluate it as follows: 

E 1 1 V^Mgi 111= EgfMV2 ^V^Mgi 

= TV {j:V^ME{gigf}MV2) 
= Tr {j:V^MRvMV2) 
Using (90 1, the matrix S' in (59 1 becomes a deterministic quantity as well, and is given by: 

S' ~ Vi[Imn - MV^]V2^V^[Imn - MV^]Vl 



(96) 



(97) 



Substituting ( [96] ) and ( [97] ) into ( [59] ), an approximate variance relation is obtained for small step-sizes: 

E\\wi\\l ~ Mwi-iWh + Tr {T.V^MRyMV2) (98) 
S' « Vi [Imn-MV^]V2^V^ [Imn-MV^]vI (99) 

Let a = vec(S) denote the vectorization operation that stacks the columns of a matrix S on top of each 
other. We shall use the notation and interchangeably to denote the weighted squared Euclidean 



norm of a vector. Using the Kronecker product property |57 p.l47]: vec{UY,V) = {V <^U)vec{T;), we 



can vectorize S' in ([99]) and find that its vector form is related to S via the following linear relation: 
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a' = vec(S') K, Ta, where, for sufficiently small steps-sizes (so that higher powers of the step-sizes can 



be ignored), the matrix T is given by 



T^[Vx [Imn-MVoo]V2 )0{Vi [Imn -MV^\P2 



(100) 



Here, we used the fact that M. and Poo are block diagonal and symmetric. Furthermore, using the property 
Tr(SX) = vec(X^)^cr, we can rewrite ([98]) as 



MwiWl «E||-!i),_i||3r^+ [wec{v'^MR^MV2)]^(J 



(101) 



It is shown in |46 pp. 344-346] that recursion ( |101[ ) converges to a steady-state value if the matrix F 
is stable. This condition is guaranteed when the step-sizes are sufficiently small (or chosen according to 
(80l) — see Appendix [C| Finally, denoting 



Ellwooll^ = lim 



(102) 



and letting z — )■ oo, expression (101 ) becomes 



^\woo\\Ta + [vec {V'^MRvMV2)f a 



so that 



K||ti;oo||(/_^), ~ [wec{V^MR,MV2)f a 



(103) 



Expression ( |103 1 is a useful result: it allows us to derive several performance metrics through the proper 
selection of the free weighting parameter a (or S). First, to be able to evaluate steady-state performance 
metrics from ( |103| ), we need (/ — F) to be invertible, which is guaranteed by the stability of matrix F 
— see Appendix |C] Given that (/ — J^) is a stable matrix, we can now resort to ( |103| ) and use it to 
evaluate various performance metrics by choosing proper weighting matrices S (or a), as it was done in 



1 31 1 for the mean-square-error estimation problem. For example, the MSE of any node k can be obtained 
by computing E||ii;oo||y with a block weighting matrix T that has an identity matrix at block {k, k) and 
zeros elsewhere: 



lE||tf>A;,oo|| =E||lt). 



2 

oo llT 



Denote the vectorized version of this matrix by t^, i.e., 

tk = vec(diag(efc) Im) 



(104) 



(105) 
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where is a vector whose kth entry is one and zeros elsewhere. Then, if we select a in ( 103 1 as 
a = {I — F)~^t}i, the term on the left-hand side becomes the desired E||t()fc oolP and MSE for node k 
is therefore given by: 

MSEfc ^ [vec {vjMR^MV2)f {I - J^)~hk 



(106) 



This value for MSE^ is actually the kih entry of Woo defined as 



lim Wi 

i— >oo 



(107) 



Then, we arrive at an expression for Woo (as opposed to the bound for it in (82 1, as was explained earlier; 



expression ( 108 1 is derived under the assumption of sufficiently small step-sizes): 



Wo 



' [In^ ( [vec {V^MR^MV2)] ^ (I-J^y^)} t 



(108) 



where t = col{ti, . . . ,t]\f}. If we are interested in the network MSE, then the weighting matrix of 
ElltOoolll- should be chosen as T = Imn/N. Let q denote the vectorized version of ImNi i-C-) 



q = vec(lMAf) 



(109) 



and select a in ( 103 1 as a = (/— J-") q/N. The network MSE is then given by 




(110) 



The approximate expressions (108 1 and ( |1 10| ) hold when the step-sizes are small enough so that (90 1 
holds. In the next section, we will see that they are consistent with the simulation results. 



V. Simulation Results 

In this section we illustrate the performance of the diffusion strategies (|24])-(|25]) by considering two 
applications. We consider a randomly generated connected network topology with a cyclic path. There 
are a total of = 10 nodes in the network, and nodes are assumed connected when they are close 
enough geographically. In the simulations, we consider two applications: a regularized least-mean-squares 
estimation problem with sparse parameters, and a collaborative localization problem. 
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A. Distributed Estimation with Sparse Data 

Assume each node k has access to data {Uk,i, d^^i}, generated according to the following model: 

dk,i = Uk,iW° + Zk,i (111) 

where {Uk,i} is a sequence of K x AI i.i.d. Gaussian random matrices. The entries of each Uk,i have 
zero mean and unit variance, and z^^i ~ M{0, ctIIr) is the measurement noise that is temporally and 
spatially white and is independent of J7; j for all k,l,i,j. Our objective is to estimate w° from the data 
set {Uk,i, dk^i} in a distributed manner. In many applications, the vector w° is sparse such as 

w" = [10 ... Olf eR^ 



One way to search for sparse solutions is to consider a global cost function of the following form pS 



N 



J^^°^iw) = y^E\\d^-Uuw\\l + pRiw) (112) 



1=1 



where R{w) and p are the regularization function and regularization factor, respectively. A popular 



choice is R{w) = \\w\\i, which helps enforce sparsity and is convex |58|-|63|. However, this choice 
is non-differentiable, and we would need to apply sub-gradient methods (44^, pp. 138-144] for a proper 
implementation. Instead, we use the following twice-differentiable approximation for \\w\\i: 

M 

R{w) = 7RI+^ (113) 

m=l 

where [w]m denotes the m-th entry of w, and e is a small number. We see that, as e goes to zero. 



R{w) II If 111- Obviously, R{w) is convex, and we can apply the diffusion algorithms to minimize ( 112 i 
in a distributed manner. To do so, we decompose the global cost into a sum of N individual costs: 

Ji{w) = E\\d^ - Uuw\\l + j^RH (114) 

for / = 1, . . . , A^. Then, using algorithms (J^l and ^T9\ , each node k would update its estimate of w° by 
using the gradient vectors of {Ji{w)}i^x^, which are given by: 

V^Jiiw) = 2E {U^iUi,i) W-2E {UlA,i) 

+ ^V^R{w) (115) 

However, the nodes are assumed to have access to measurements {Ui^i,di^k} and not to the second- 
order moments ¥,(^Uf-Ui^i^ and ¥,(^Uf-di^i^ . In this case, nodes can use the available measurements to 
approximate the gradient vectors in (|24]) and ( |25| ) as: 

V^Jiiw) = 2Ul, [Ui^iW-d^] + ^V^R{w) (116) 
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where 



V,„R(w) 



[w\i 



[W M 



w 



M 



(117) 



In the simulation, we set M = 50, K = 5, = 1, and w° = [1 ... 1]^. We apply both diffusion and 
incremental methods to solve the distributed learning problem, where the incremental approach |[6|-||9| 
uses the following construction to determine Wi: 

• Start with i/'o,i = it'i-i at the node at the beginning of the incremental cycle. 

• Cycle through the nodes k = 1, . . . , N: 



(118) 



• Set Wi ^ 

• Repeat. 

The results are averaged over 100 trials. The step-sizes for ATC, CTA and non-cooperative algorithms are 
set to ^ = 10^^, and the step-size for the incremental algorithm is set to ^ = 10^"^/A^. This is because the 
incremental algorithm cycles through all N nodes every iteration. We therefore need to ensure the same 
convergence rate for both algorithms for a fair comparison f35l. For ATC and CTA strategies, we use 
simple averaging weights for the combination step, and for ATC and CTA with gradient exchange, we use 



Metropolis weights for {ci^k} to combine the gradients (see Table Illin 1 31 1 for the definitions of averaging 



weights and Metropolis weights). We use expression (110) to evaluate the theoretical performance of the 



diffusion strategies. As a remark, expression ( |1 10| ) gives the MSB with respect to the minimizer of the 
cost js'°^(tt;) in ( 1 12 1. In this example, the minimizer of the cost ( 1 12[ ), denoted as w°, is biased away 
from the model parameter w° in ( |111| ) when the regularization factor 7 7^ 0. To evaluate the theoretical 
MSB with respect to w°, we use 

N 



MSD = lim — VElItu" 

j-s>oo N ^ 



k=l 



+ .lim 



N 



WkA 



(119) 



fc=i 



where the second term in ( 1 19 1 can be evaluated by expression (|1 10|) with w° replaced by w°. Moreover, 



in the derivation of ( 119 ), we used the fact that limj__5.oo — Wk^i) = to eliminate the cross term, and 
this result is due to (94) with w° there replaced by w°. Fig. 2(a)| shows the learning curves for different 
algorithms for 7 = 2 and e = 10^'^. We see that the diffusion and incremental schemes have similar 
performance, and both of them have about 10 dB gain over the non-cooperation case. To examine the 
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Fig. 2. Transient and steady-state performance of distributed estimation witli sparse parameters. 



impact of the parameter e and the regularization factor 7, we show the steady-state MSE for different 



values of 7 and e in Fig. 2(b) When e is small (e = 10 adding a reasonable regularization (7 = 1 ~ 4) 



decreases the steady-state MSE. However, when e is large (e = 1), expression ( |1 13| ) is no longer a good 
approximation for \\w\\i, and regularization does not improve the MSE. 

B. Distributed Collaborative Localization 



The previous example deals with a convex cost (112i. Now, we consider a localization problem that 



has a non-convex cost function and apply the same diffusion strategies to its solution. Assume each node 
is interested in locating a common target located at w° = [0 O]-'^. Each node k knows its position Xk and 
has a noisy measurement of the squared distance to the target: 

dk{i) = \\w° -Xkf + Zk{i), k = 1,2,...,N 

where Zk{i) ~ AA(0,(T^^) is the measurement noise of node k at time i. The component cost function 
Jk{w) at node k is chosen as 

Jkiw) = ^E\dk{i)-\\w-Xkff (120) 

where we multiply by 1 /4 here to eliminate a factor of 4 that will otherwise appear in the gradient. If each 
node k minimizes Jk{w) individually, it is not possible to solve for w". Therefore, we use information 
from other nodes, and instead seek to minimize the following global cost: 

N 

(121) 



1 ^ 

J^'°''{w) = -^E\dk{i)-\\w-Xkf\' 



4 

k=l 
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This problem arises, for example, in cellular communication systems, where multiple base-stations are 
interested in locating users using the measured distances between themselves and the user. Diffusion 



algorithms ( |T8) and ( \19\ can be applied to solve the problem in a distributed manner. Each node k would 
update its estimate of w° by using the gradient vectors of {Ji{w)}i,=j\f^, which are given by: 

V^J;(if) = -E,di{i) (w - xi) + \\w - xi\\^{w - xi) (122) 

However, the nodes are assumed to have access to measurements {di{i),xi} and not to E<i;(i). In this 
case, nodes can use the available measurements to approximate the gradient vectors in ([24]) and ( [25] ) as: 

^wJi{w) = -di{i){w - xi) + \\w - xi\\'^{w - xi) (123) 

If we do not exchange the local gradients with neighbors, i.e., if we set S = I, then the base-stations 
only share the local estimates of the target position w" with their neighbors (no exchange of {xi}i^_\fj- 



We first simulate the stationary case, where the target stays at w°. In Fig. 3(a) we show the MSE curves 
for non-cooperative, ATC, CTA, and incremental algorithms. The noise variance is set to cj^^ = 1. We 
set the step-sizes to /x = 0.0025/A^ for the incremental algorithm, and /i = 0.0025 for other algorithms. 
For ATC and CTA strategies, we use simple averaging for the combination step {ai,fc}, and for ATC 
and CTA with gradient exchange, we use Metropolis weights for {ci k} to combine the gradients. The 
performance of CTA and ATC algorithms are close to each other, and both of them are close to the 



incremental scheme. In Fig. 3(b) we show the steady state MSE with respect to different values of /x. 
We also use expression ( 110[ ) to evaluate the theoretical performance of the diffusion strategies. As the 
step-size becomes small, the performances of diffusion and incremental algorithms are close, and the 
MSE decreases as fi decreases. Furthermore, we see that exchanging only local estimates {S = I) is 
enough for localization, compared to the case of exchanging both local estimates and gradients (S = C). 
Next, we apply the algorithms to a non-stationary scenario, where the target moves along a trajectory. 



as shown in Fig. 3(c) The step-size is set to = 0.01 for diffusion algorithms, and to ^ = 0.01/A^ 
for the incremental approach. To see the advantage of using a constant step-size for continuous tracking, 
we also simulate the vanishing step-size version of the algorithm from |39|, |43| {nk,i = 0.01/z). The 



diffusion algorithms track well the target but not the non-cooperative algorithm and the algorithm from 



1 39 1, 1 43 1, because a decaying step-size is not helpful for tracking. The tracking performance is shown 
in Fig. |3(d) 
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CTA (S=I) 
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Non-cooperative 
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(a) Learning curves for stationary target (/i — 0.0025). 



[~] ■ Target Trajectory 
— H — Incremental 
-0-ATC(S=l) 
-e-CTA(S=l) 
— 1— ATC(S=C) 
-»— CTA(S=C) 

Approach in [39,43] 
Non-cooperative 
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Step-size |i 
(b) Steady-state performance for stationary target. 



-2 2 
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Number of Iterations 
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(c) Tracking a moving-target by node 1 (fi — 0.01). 



(d) Learning curves for moving target (fi — 0.01). 



Fig. 3. Performance of distributed localization for stationary and moving targets. Diffusion strategies employ constant step-sizes, 
which enable continuous adaptation and learning even when the target moves (which corresponds to a changing cost function). 



VI. Conclusion 

This paper proposed diffusion adaptation strategies to optimize global cost functions over a network 
of nodes, where the cost consists of several components. Diffusion adaptation allows the nodes to 
solve the distributed optimization problem via local interaction and online learning. We used gradient 
approximations and constant step-sizes to endow the networks with continuous learning and tracking 
abilities. We analyzed the mean-square-error performance of the algorithms in some detail, including 
their transient and steady-state behavior. Finally, we applied the scheme to two examples: distributed 
sparse parameter estimation and distributed localization. Compared to incremental methods, diffusion 
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Strategies do not require a cyclic path over the nodes, which makes them more robust to node and link 
failure. 

Appendix A 
Proof of Mean-Square Stability 



Taking the oo— norm of both sides of (79 1, we obtain 



\\yVi\\oo < ||^'2^rPf ||oo-||Wi-l||oo+CT2||5||M|P2^||oo-|l^^llL 
— 11-^2 I loo ■ ||r||oo ■ 1 1 -Pi I loo ■ ||VVj-_i||oo 
+ '^^ll'S'lll ■ l|-P2"l|oo • II^^IlL 

= ||r||oo-||W^-i||oo + ( max fil)-al\\S\\l (124) 

V l<k<N / 

where we used the fact that ||-Pj^||oo = llP'^lloo = 1 because each row of and sums up to one. 



Moreover, from (76 1, we have 



l°° = + f^k<^\\S\\i) (125) 

1<A;<A' 



Iterating ( |124| ), we obtain 

llWiiloo < l|r||^-||Wo||c 



i-i 

+ (^max^/xi) -alWSWl^WTWi, (126) 

j=0 



We are going to show further ahead that condition (80 1 guarantees ||r||oo < 1- Now, given that ||r||oo < 1, 



the first term on the right hand side of ( |126| ) converges to zero as f — )• oo, and the second term on the 



right-hand side of (126 1 converges to 



v2||c||2 



.lima.2||S||?j;||r|P^ = -^4^ (127) 

J— >-CXD ^ ' I — II I II 

j=0 



Therefore, we establish ( [82| ) as follows: 

^.'ll^ll? 



limsup llWilloo < I max //^ 



i<k<N "J l-||r||oo 

2^2 

(128) 



max 111. ■ ||5||-|(T„ 
1— max (7^ + /i?a||5||? 



May 15, 2012 



DRAFT 



33 

The only fact that remains to prove is to show that (80 1 ensures ||r||oo < 1- From (125 1, we see that 
the condition ||r||oo < 1 is equivalent to requiring: 

^l + lila\\S\\l<l, k = l,...,N. (129) 



Then, using ([65]), this is equivalent to: 

N 2 

1 - /^fc^s«,fcA/,max) +ijla\\S\\l<l (130) 



1=1 

N 



1 - /^fc^S^^fcA/^min) +/ifca||5||i < 1 (131) 



1=1 



for k = 1, . . . , N. Recalling the definitions for cr^ max and ak^mm in ( |8T] > and solving these two quadratic 
inequalities with respect to /i^, we arrive at: 

„ 2(Tfc max „ 2cr/j jnin 

<max + «ll'5|lf <min + "ll^lll 



and we are led to ( 80 1 



Appendix B 
Block Maximum Norm of a Matrix 

Consider a block matrix X with blocks of size M x M each. Its block maximum norm is defined as 



\W\\ A -^2; boo /m\ 

||A llft^oo = max — — (132) 

where the block maximum norm of a vector x = col{xi, . . . ,xn}, formed by stacking N vectors of size 
M each on top of each other, is defined as [35J: 

ll^^lkoo— max \\xf:\\ (133) 

l<k<N 

where || • || denotes the Euclidean norm of its vector argument. 

Lemma 4 (Block maximum norm). If a block diagonal matrix X = diagjXi, . . . , X^} G K^^^x^^ 
consists of N blocks along the diagonal with dimension M x M each, then the block maximum norm of 
X is bounded as 

\\X\\b,oo < max \\Xk\\ (134) 

l<fc<Af 

in terms of the 2-induced norms o/ {X^} (largest singular values). Moreover, if X is symmetric, then 



equality holds in ( |134[ ) 
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Proof: Note that Xx = col{XiXi,. . . ,Xnxn}- Evaluating the block maximum norm of vector Xx leads 
to 



\Xx\ 



max llXfcXfcll 

l<fc<Af 



< max llXfcll • \\xk\\ 

l<k<N 

< max • max 



l<fe<Af 



l<fc<Af 



(135) 



Substituting ( |135| ) and ( |133| ) into ( |132| ), we establish ( |134| ) as 

l-^'^ II 6,00 



X 



b,oo 



max — 



< max 



maxi<fc<jv ll-'^fcll • maxi<fc<Ar \\xk\ 



max llXfcl 
i<fc<Af 



(136) 



Next, we prove that, if all the diagonal blocks of X are symmetric, then equality should hold in ( |136 1. 
To do this, we only need to show that there exists an xq 7^ 0, such that 

||^a;o||b,oo 

ll'^^O II 6.00 



max \\X]^\ 
l<k<N 



(137) 



which would mean that 



\X\ 



\Xx\ 



b,oo 



b,oo 



> 



max — 

Xy^Q Uplift, 00 

ll^a^ollfe^oo 
II -^0 life, 00 

max llXfcll 

l<A;<Af 



Then, combining inequalities ( 136 1 and (138 1, we would obtain desired equality that 



l-'^llb.oo = max \\Xk\\ 

l<k<N 



(138) 



(139) 



when X is block diagonal and symmetric. Thus, without loss of generality, assume the maximum in 



(137 1 is achieved by Xi, i.e.. 



max ||Xfc| 

l<A;<Af 



1X1 



For a symmetric X^, its 2-induced norm \\Xk\\ (defined as the largest singular value of X^) coincides with 
the spectral radius of Xk- Let Aq denote the eigenvalue of Xi of largest magnitude, with the corresponding 
right eigenvector given by zq. Then, 

max ||Xfc|| = |Ao|, XiZq = XqZq 

l<k<N 
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We select xq = col{zo, 0, . . . , 0}. Then, we establish ( |137| ) by: 

||^a;o||6,oo ||col{XiZo, 0, . . . , 0}||6,oo 



||col{zo,0,...,0}||fe, 
ll^i^oll IIAo^oll 



|Ao| = max llXfcll 

1<K<A' 



Appendix C 
Stability of B and J" 



Recall the definitions of the matrices B and T from (93 1 and (100): 



B = V^[lMN-MV^]Vf 



= b'^^b'^ 

From (fT40l)-(fT4T]), we obtain (see Theorem 13.12 from fST. p.l41]): 

p{:F) = p{B^(B>B'') = [piB'')f = [p{B)f 



(140) 



(141) 



(142) 



where /?(•) denotes the spectral radius of its matrix argument. Therefore, the stability of the matrix T is 
equivalent to the stability of the matrix B, and we only need to examine the stability of B. Now note 
that the block maximum norm (see the definition in Appendix iBll of the matrix B satisfies 



l^llb.oo < WhiN — ■M.'Doo\\b,oo 



since the block maximum norms of Vi and 1^2 are one (see |35 p.4801]): 



1 



(143) 



(144) 



1 ll6,oo ' II 2 llb^oo 

Moreover, by noting that the spectral radius of a matrix is upper bounded by any matrix norm (Theorem 
5.6.9, iBOl p.297]) and that Imn — MV^ is symmetric and block diagonal, we have 



MN 



MV. 



oo b,oo 



Pil 



MN 



(145) 



Therefore, the stability of B is guaranteed by the stability of Imn — AdVoo- Next, we call upon the 
following lemma to bound \\Imn — -MVooWi, ^o- 



Lemma 5 (Norm of Imn—-M'Doo)- It holds that the matrix Vao defined in (90l satisfies 

||/MAf-A^/^oo||fcoo ^ max 7fc 

l<k<N 



(146) 
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where 7^ is defined in ( 65 1 



Proof: Since Dqo is block diagonal and symmetric, Imn — MT^oo is also block diagonal with blocks 
{/a/ — /XfcDfc oo}» where Pfe^oo denotes the fcth diagonal block of 2?oo- Then, from ( |134| ) in Lemma |4] in 
Appendix |B] it holds that 

WhiN-MV^W^ = max ||/M-/"fc^5fc,oo|| (147) 



By the definition of V^o in (90 1, and using condition (45 1 from Assumption [TJ we have 



which implies that 



■TV \ /TV \ 

,min ,max 
.1=1 / \ 1=1 J 

/TV \ 
hi - fJ'kT^k,oo > I 1 ~ A'fc X] ■'^^,kh,ma.x 1 " Im (148) 

/TV \ 

Im - /xfcPfc,oo < ( 1 - A^fc ^ s/,fcAi,min 1 • Itw (149) 



Thus, \\Im — fJ'k'^k ooW^lk- Substituting into ( 147 1, we get ( 146 1. 



Substituting ( 146 1 into ( 145 1, we get: 



p(e)< max 7fc (150) 

l<fe<TV 

As long as max 71. < 1, then all the eigenvalues of B will he within the unit circle. By the definition 

l<A,<Af 



of 7j!; in ( |65| ), this is equivalent to requiring 

|1 — A'fe'^fc maxi < 1) |1 — /^A:0'fc,min| < 1 



for k = 1, . . . ,N, where fx^ ^ax and o"jt min are defined in ( |8T] ). These conditions are satisfied if we 
choose fik such that 

< /ifc < 2/crfc,max, k = l,...,N (151) 



which is obviously guaranteed for sufficiently small step-sizes (and also by condition (80l). 
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